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(54) Method and system for end-to-end problem determination and fault isolation for storage 
area networks 



(57) A method and system for problem determina- 
tion and fault isolation in a storage area network (SAN) 
is provided. A complex configuration of multi-vendor 
host systems, FC switches, and storage peripherals are 
connected in a SAN via a communications architecture 
(CA). A communications architecture element (CAE) is 
a network-connected device that has successfully reg- 
istered with a communications architecture manager 
(CAM) on a host computer via a network service proto- 
col, and the CAM contains problem determination (PD) 
functionality for the SAN and maintains a SAN PD infor- 
mation table (SPDIT). The CA comprises all network- 



connected elements capable of communicating infor- 
mation stored in the SPDIT. The CAM uses a SAN to- 
pology map and the SPDIT are used to create a SAN 
diagnostic table (SDT). A failing component in a partic- 
ular device may generate errors that cause devices 
along the same network connection path to generate er- 
rors. As the CAM receives error packets or error mes- 
sages, the errors are stored in the SDT, and each error 
is analyzed by temporally and spatially comparing the 
error with other errors in the SDT If a CAE is determined 
to be a candidate for generating the error, then the CAE 
is reported for replacement if possible. 




Printed by Jouve, 75001 PARIS (FR) 



1 



EP1 115 225 A2 



2 



Description 

[0001] The present invention relates to an improved 
data processing system and, in particular, to a method 
and apparatus for computer network managing. 5 
[0002] A Storage Area Network (SAN) is an "open 
system" storage architecture that allows multiple host 
computers to share multiple storage peripherals, and in 
particular, to share storage peripherals via a Fibre Chan- 
nel (FC) network switch. The FC switch, host systems, 
and storage peripherals may be manufactured by differ- 
ent vendors and contain different operating environ- 
ments. 

[0003] Currently, there is a lack of an end-to-end prob- 
lem determination capability or specification for an FC 
SAN. A complex configuration of multi-vendor systems, 
network switches, and peripherals makes it significantly 
more difficult to perform problem determination in a SAN 
environment than existing point-to-point storage config- 
urations. As a result, failures in a SAN environment will 
cause an increase of system downtime as welt as in- 
creasing cost of system maintenance. 
[0004] Accordingly, the invention provides a method 
for processing errors within a storage area network 
(SAN), the method comprising the computer-imple- 
mented steps of: generating a SAN topology map; gen- 
erating a SAN problem determination information table 
(SPDIT); and generating a SAN diagnostic table (SDT) 
using the SAN topology map and the SPDIT. 
[0005] The invention further provides a data process- 
ing system for communicating error information in a stor- 
age area network (SAN), the data processing system 
comprising: a network comprising in -band Fibre Chan- 
nel communication links and out-of-band communica- 
tion links, wherein the network supports a communica- 
tions architecture (CA); a plurality of storage devices 
connected to the network; a plurality of host computers 
connected to the network, wherein at least one of the 
plurality of host computers comprises a communica- 
tions architecture manager (CAM) containing problem 
determination (PD) functionality, wherein a CAM main- 
tains a SAN PD information table (SPDIT), and wherein 
the CA comprises ail network-connected elements ca- 
pable of communicating information stored in the SP- 
DIT. 

[0006] The invention still further provides a data 
processing system for processing errors within a stor- 
age area network (SAN), the data processing system 
comprising: first generating means for generating a SAN 
topology map; second generating means for generating 
a SAN problem determination information table (SP- 
DIT); and third generating means for generating a SAN 
diagnostic table (SDT) using the SAN topology map and 
the SPDIT. 

[0007] The invention still further provides a computer 
program comprising program code for controlling a data 
processing system to perform operations implementing 
a method for processing errors within a storage area 



network (SAN), the program code comprising: first in- 
structions for generating a SAN topology map; second 
instructions for generating a SAN problem, determina- 
tion information table (SPDIT); and third instructions for 
generating a SAN diagnostic table (SDT) using the SAN 
topology map and the SPDIT 

[0008] According to a preferred embodiment, a meth- 
od and system for problem determination and fault iso- 
lation in a storage area network (SAN) is provided. 
[0009] Preferably, a method and apparatus are pro- 
vided that define an "open system", real-time, end-to- 
end, error detection architecture that incorporates fault 
isolation algorithms to identify failing systems and/or 
components connected to a SAN. 
[0010] According to a preferred embodiment, a com- 
plex configuration of multi-vendor host systems, FC 
switches, and storage peripherals are connected in a 
SAN via a communications architecture (CA). A com- 
munications architecture element (CAE) is a network- 
connected device that has successfully registered with 
a communications architecture manager (CAM) on a 
host computer via a network service protocol, and the 
CAM contains problem determination (PD) functionality 
for the SAN and maintains a SAN PD information table 
(SPDIT). 

[001 1] The CA preferably comprises all network-con- 
nected elements capable of communicating information 
stored in the SPDIT. The CAM uses a SAN topology map 
and the SPDIT to create a SAN diagnostic table (SDT). 
A failing component in a particular device may generate 
errors that cause devices along the same network con- 
nection path to generate errors. As the CAM receives 
error packets or error messages, the errors are prefer- 
ably stored in the SDT, and each error is analyzed by 
temporally and spatially comparing the error with other 
errors in the SDT. If a CAE is determined to be a candi- 
date for generating the error, then the CAE is reported 
for replacement if possible. 

[0012] A preferred embodiment of the present inven- 
tion will now be described, by way of example only, with 
reference to the following drawings: 

Figure 1 is a pictorial representation depicting a da- 
ta processing system in which the preferred embod- 
iment of the present invention is implemented; 

Figure 2 is an example block diagram illustrating 
internal components of a server-type data process- 
ing system that implements the present invention 
according to a preferred embodiment; 

Figure 3 is a diagram depicting a communications 
architecture for data processing systems that par- 
ticipate in the SAN problem determination method- 
ology implemented in accordance with a preferred 
embodiment of the present invention; 

Figure 4 is a table depicting a SAN Problem Deter- 
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mination Information Table (SPDIT) In accordance 
with a preferred embodiment of the present inven- 
tion; 

Figure 5A is a simplified network topology diagram 
for a SAN in accordance with a preferred embodi- 
ment; 

Figure 5B is a table providing a topology map for 
the SAN shown in Figure 5A in accordance with a 
preferred embodiment of the present invention; 

Figure 6 is a diagram depicting a SAN Diagnostic 
Table for a SAN in accordance with a preferred em- 
bodiment of the present invention; 

Figure 7 is a table depicting the weightings to be 
used in real-time diagnostic analysis for various er- 
rors in accordance with a preferred embodiment of 
the present invention; and 

Figures 8A-8D are flowcharts depicting a process 
for a real-time diagnostic algorithm for SAN end-to- 
end fault isolation of a single failing SAN element in 
accordance with a preferred embodiment of the 
present invention. 

[001 3] With reference now to Figure 1 , a pictorial rep- 
resentation depicts a data processing system in which 
a preferred embodiment of the present invention is im- 
plemented. A computer 100 is depicted, which includes 
a system unit 110, a video display terminal 102, a key- 
board 104, storage devices 108, which may include flop- 
py drives and other types of permanent and removable 
storage media, and mouse 1 06. Additional input devices 
may be included with computer 100. Computer 100 can 
be implemented using any suitable computer, for exam- 
ple, an IBM RISC/System 6000 system, a product of In- 
ternational Business Machines Corporation in Armonk, 
New York, running the Advanced Interactive Executive 
(AIX) operating system, also a product of IBM Corpora- 
tion. Although the depicted representation shows a 
server-type computer, other embodiments of the 
present invention are implemented in other types of data 
processing systems, such as workstations, network 
computers, Web-based television set-top boxes, Inter- 
net appliances, etc. Computer 100 also preferably in- 
cludes a graphical user interface that is implemented by 
means of system software residing in computer reada- 
ble media in operation within computer 100. 
[0014] Figure 1 is intended as an example and not as 
an architectural limitation for the present invention. 
[0015] With reference now to Figure 2, a block dia- 
gram depicts a typical organization of internal compo- 
nents in a data processing system. Data processing sys- 
tem 200 employs a variety of bus structures and proto- 
cols. Although the depicted example employs a PCI bus, 
an ISA bus, and a 6XX bus, other bus architectures and 



protocols may be used. 

[0016] Processor card 201 contains processor 202 
and 12 cache 203 that are connected to 6XX bus 205. 
System 200 may contain a plurality of processor cards. 
5 Processor card 206 contains processor 207 and L2 
cache 208. 

[0017] 6XX bus 205 supports system planar 210 that 
contains 6XX bridge 21 1 and memory controller 21 2 that 
supports memory card 213. Memory card 213 contains 

10 local memory 214 consisting of a plurality of dual in-line 
memory modules (DIM Ms) 215 and 216. 
[0018] 6XX bridge 211 connects to PCI bridges 220 
and 221 via system bus 222. PCI bridges 220 and 221 
are contained on native I/O (NIO) planar 223 which sup- 

is ports a variety of I/O components and interfaces. PCI 
bridge 221 provides connections for external data 
streams through network adapter 224 and a number of 
card slots 225-226 via PCI bus 227. PCI bridge 220 con- 
nects a variety of I/O devices via PCI bus 228. Hard disk 

20 229 may be connected to SCSI host adapter 230, which 
is connected to PCI bus 228. Graphics adapter 231 may 
also be connected to PCI bus 228 as depicted, either 
directly or indirectly. 

[0019] ISA bridge 232 connects to PCI bridge 220 via 

25 pci bus 228. ISA bridge 232 provides interconnection 
capabilities through NIO controller 233 via ISA bus 234, 
such as serial connections 235 and 236. Floppy drive 
connection 237 provides removable storage. Keyboard 
connection 238 and mouse connection 239 allow data 

so processing system 200 to accept input data from a user. 
Non-volatile RAM (NVRAM) 240 provides non-volatile 
memory for preserving certain types of data from system 
disruptions or system failures, such as power supply 
problems. System firmware 241 is also connected to 

35 ISA bus 234 and controls the initial BIOS, Service proc- 
essor 244 is connected to ISA bus 234 and provides 
functionality for system diagnostics or system servicing. 
[0020] Service processor 244 detects errors and 
passes information to the operating system. The source 

40 of the errors may or may not be known to a reasonable 
certainty at the time that the error is detected. The op- 
erating system may merely log the errors or may other- 
wise process reported errors. 

[0021] Those of ordinary skill in the art will appreciate 
45 that the hardware in Figure 2 may vary depending on 
the system implementation. For example, the system 
may have more processors, and other peripheral devic- 
es may be used in addition to or in place of the hardware 
depicted in Figure 2. The depicted examples are not 
so meant to imply architectural limitations with respect to 
the present invention. 

[0022] With reference now to Figure 3, a diagram de- 
picts a communications architecture for data processing 
systems that participate in the SAN problem determina- 
55 tion methodology implemented in accordance with a 
preferred embodiment of the present invention. Network 
300 comprises a set of computers, switches, and stor- 
age devices that may or may not participate in the com- 
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munications architectures. 

[0023] The Communications Architecture <CA) com- 
prises all SAN-connected elements capable of commu- 
nicating any or all of the information defined in a SAN 
Problem Determination Information Table (SPDIT), 
which is described in more detail further below. 
[0024] Each SAN connected element participating in 
the CA is called a CA Element (CAE). Any element not 
participating in the CA is called a CA Non-participant 
(CAN). These elements are distinguished because they 
both participate in the SAN topology and thereby the 
problem determination (PD) capabilities of the system 
Windows NT™ server 302, mainframe computer 304 
Unix™ server 306, and Linux™ server 308 are comput- 
ers that participate in the CA and are thus CAEs. Win- 
dows NT™ server 302, mainframe computer 304, and 
Unix™ server 306 are also host computers that may 
support various clients, which may require access to the 
storage devices. Each of computers 302-306 has a Host 
Bus Attach (HBA), which is a type of network adapter 
for FC hosts. FC switches 31 1-313 are CAEs, and some 
of the storage devices are also CAEs. In the example 
shared RAIDS (Redundant Array of Independent Disks) 
321-323 and shared tape 324 are CAEs, while shared 
tape 325 is a CAN. 

[0025] The CA can communicate via the FC switching 
fabric via in-band communication links 341-352 using 
the TCP/IP protocol and/or via an out-of-band TCP/IP 
communication network on communication links 
331 -334 that all SAN elements share. It should be noted 
that the communication links depicted in Figure 3 may 
be logical connections that share a single physical con- 
nection. Alternatively, the devices may be connected by 
more than one physical communication link. 
[0026] The protocols used by the CA to issue and/or 
collect information are defined to be both SNMP/MIB 
(Simple Network Management Protocol/Management 
Information Base, an SNMP structure that describes the 
particular device being monitored) and native FC based. 
The use of these two protocols allows both device/host- 
specific and SAN-specific information to be collected 
and subsequently used for end-to-end problem deter- 
mination. 

[0027] CA Managers (CAMs) are special CAEs in 
which the end-to-end PD capabilities of the system re- 
side. The SPDIT resides in the CAM and every CAE is 
automatically registered with a CAM (via native FC and/ 
or SNMP services). CAEs are those elements that suc- 
cessfully register, and CANs are those elements that 
cannot register with the CAM but are known to be 
present via the SAN topology discovery process, which 
is discussed in more detail further below. CAM S support 
any FC Extended Link Services (ELS) that are relevant 
to end-to-end problem determination. 
[0028] CAMS may be categorized as a primary or ac- 
tive CAM and secondary or inactive CAMs. CAMS are 
highly available elements that replicate SPDIT and reg- 
istration information. For example, a secondary CAM 



and a primary CAM may share a heartbeat signal so that 
a secondary CAM, operating in a redundant manner, 
may assume the duties of the primary CAM if the primary 
CAM appears to have failed by not responding to the 
s heartbeat signal. The problem determination interface 
to the CAM is comprised of a SAN PD Application Pro- 
gramming Interface (SAN PD API). The SAN PD API 
defines the communication interface between the CAM 
and any other operating environment that can read CAM 
10 information or status. 

[0029] With reference now to Figure 4, a table depicts 
a SAN Problem Determination Information Table (SP- 
DIT) in accordance with a preferred embodiment of the 
present invention. The SPDIT is comprised of all known 
is products/elements and the information types that can 
be communicated on the CA. The format of the SPDIT 
may vary depending upon the number of devices in the 
CA, the type of products that are supported, the infor- 
mation associated with the devices, etc. For example 
20 the SPDIT would contain information concerning each 
device shown in Figure 3. 

[0030] SPDIT 400 contains, by way of example, the 
following record entries: 

vendor attribute 401, product identifier 402, info type 
25 403, and description 404. Each record in SPDIT 400 
contains data for these record entries. Vendor attribute 
401 contains the manufacturer of a particular device on 
the CA. Product identifier 402 contains vendor-assigned 
information for identifying a particular device, such as 
50 model type, model number, product serial number, etc. 
[0031] Information type 403 contains data related to 
the type of communication links supported by the de- 
vice, the format of error conditions or error definitions 
supported by the device, etc. Description attribute 404 
35 provides information about the type of error information 
that should be expected by the product. For example, if 
the description attribute record only contains an indica- 
tion that the product is ELS Registered Link Incident 
Record (RLIR) compatible, then a CAM-related process 
40 would not expect to receive out-of-band MIBs for the 
product. 

[0032] The SPDIT will generally contain all informa- 
tion used to indicate status/error conditions by SAN ca- 
pable peripherals, hosts, and switches. This would in- 
* dude native FC link and extended link error definitions, 
and MIB definitions. These definitions can include field 
replaceable unit (FRU) component information, which 
can be located in a MIB or embedded in the error report- 
ing protocol and can be used to determine the granular- 
50 ity to which failing components can be isolated. 

[0033] As noted previously, CAMs are special CAs in 
which the end-to-end PD capabilities of the system re- 
side, including the SPDIT. The CAM initialization proc- 
ess includes the discovery and registration of all FC 
55 nodes connected to both the in-band SAN and out-of- 
band network. The CAM initialization process uses FC 
in-band and CA out-of-band (via SNMP) discovery/reg- 
istration processes. This process provides a topology 
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map (TM) of the SAN that includes all registered and 
non-registered SAN connected elements along with 
knowledge of the element types (hosts, peripherals, 
switches), explicit connections/paths, and their relevant 
vendor and SPDIT information. 5 
[0034] With reference now to Figu re 5A, a simplified 
network topology diagram for the SAN in accordance 
with a preferred embodiment is shown. FC switch 501 
contains ports 511-513 providing connection points be- 
tween FC switch 501 and CAEs 521 -523, also labelled io 
CAE A, CAE B, and CAE C. From the perspective of the 
CA, FC switch ports 511-513 are CAEs because the 
ports are capable of failing or generating errors and 
could be replaced after being properly diagnosed as a 
source of errors. 15 
[0035] With reference now to Figure 5B, a table pro- 
vides a topology map for the SAN shown in Figure 5A 
in accordance with a preferred embodiment of the 
present invention. The TM is represented as a two-di- 
mensional table with both the left column and the top 20 
row containing the SAN elements, both CAE and CAN 
devices, connected to the switch, such as FC switch 501 
in Figure 5A. The diagonal cells contain ail the SPDIT/ 
type information about the corresponding element and 
the switch port to which it is connected. The other cells 25 
contain the directional paths between the elements. For 
example, the table shows the direction path between 
CAE A and CAE C using the path between ports 3 and 
1. Multiple paths are possible. The topology and regis- 
tration discovery process are periodically repeated to 30 
ensure that the TM is current. The CAM will also register 
with any SAN elements providing Extended Link Serv- 
ices that can be used for PD. 

[0036] With reference now to Figure 6, a diagram de- 
picts a SAN Diagnostic Table for a SAN in accordance 35 
with a preferred embodiment of the present invention. 
The TM of a SAN is used to create a SAN Diagnostic 
Table (SDT) that is used for First Error Data Collection 
(FEDC) and Real-time Diagnostic Analysis (RDA). The 
SDT shown in Figure 6 is similar to the TM shown in 40 
Figure 5B except that it contains an extra row for each 
switch/fabric element. 

[0037] The diagonal SDT cells are used to hold the 
errors reported by the CAE corresponding to its row/col- 
umn, including switch ports. Each point in a path, i.e. 45 
SDT cell, represents another SAN-connected element. 
Each ceil contains the information collected in the TM 
so that specific product behaviors are known and proper 
diagnostic decisions can be made. For example, diag- 
nostic queries may be presented, such as whether it is 50 
more likely that a storage device is causing a link error 
versus a Host Bus Attach (HBA) if out-of-band SCSI de- 
vice errors accompany in-band HBA FC link errors. 
[0038] The exemplary error information contained in 
Figure 6 illustrates the utility of RDA using the SDT Row 55 
1 indicates that CAE A has reported an in-band FC link 
timeout. Row 3 indicates an out-of-band hardware con- 
troller error on CAE C. These two errors are related be- 



cause they occurred in the same time frame, as shown 
by the timestamps associated with the error information. 
Row 5 indicates that an in-band FC link error has oc- 
curred, but given the stored timestamp, the error in row 
5 is unrelated to the previous two. Therefore, the table 
depicts two separate problems: the first is related to a 
controller hardware failure in CAE C, and the second is 
a FC Link failure on CAE 2 in the FC Switch. 
[0039] With reference now to Figure 7, a table depicts 
the weightings to be used in real-time diagnostic analy- 
sis for various errors in accordance with a preferred em- 
bodiment of the present invention. The RDA algorithms 
traverse error reporting elements of the SDT whenever 
an FEDC event occurs in order to determine the appro- 
priate response. The RDA uses weighted decision anal- 
ysis in order to isolate the failing component. Two broad 
categories are illustrated with H=Highest, M=Middle, 
L=Lowest weighting. The SDT traversal algorithms and 
error weightings are dynamic and would be changed to 
accommodate the complexity of the SAN topology and 
the nature of its connected elements. 
[0040] The weighting table shown in Figure 7 pro- 
vides a simple illustration of the strong-to-weak weight- 
ing scale that applies to a typical SAN environment. If 
the SAN grows to just a few 16 port switches with its 
associated hosts and peripherals, the number of possi- 
ble nodes that can report errors due to a single disk drive 
error or HBA timeout error can grow to a large number. 
Without global end-to-end RDA diagnostic capability, 
the task of isolating the failing component becomes hit- 
or-miss. In a multi-vendor SAN, it is common for multiple 
intermittent, recoverable, device errors, i.e. soft errors, 
to go unnoticed by the host. Eventually, the device may 
encounter an unrecoverable error, i.e. a hard error, that 
results in a system crash. The in-band and out-of-band 
mechanisms provided by the preferred embodiment 
would detect and report the recoverable errors as soon 
as they occur. 

[0041] With reference now to Figures 8A-8D, flow- 
charts depict a process for a real-time diagnostic algo- 
rithm (RDA) for SAN end-to-end fault isolation of a single 
failing SAN element in accordance with a preferred em- 
bodiment of the present invention. The RDA uses two 
dynamic mechanisms to isolate faults: 

1 . Temporal Correlation Window (TCW) - The TCW 
is scalar value, i.e. time range, used to constrain 
fault isolation searching of the SDT in the time di- 
mension so that the probability of misdiagnosis is 
minimized in the time dimension. 

2. Spatial Correlation Path (SCP) - The SCP is a 
data structure that is used to constrain fault isolation 
searching in the spatial domain of the SDT so that 
only known system-to-subsystem associations are 
scrutinized and so that the probability of misdiagno- 
sis is minimized in the spatial dimension. The SCP 
copies elements from the SDT during the RDA. 
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[0042] The goal of the RDA is to correlate all fault in- 
formation received in time, location, and severity until 
the source of the fault is isolated with a high degree of 
certainty. This process terminates after a single reported 
fault or after a series of reported faults. 
[0043] The general RDA for SAN end-to-end fault iso- 
lation of a single failing SAN element is described as 
follows. The process begins when a CAM initializes all 
connected paths in the CA into the SDT (step 801). The 
SDT is initialized with all connected paths, i.e., paths A- 
>B, B->C, etc. Only those paths that should be able to 
make connections are entered. These paths are estab- 
lished by the topology mapping, such as a TM similar to 
the TM shown in Figure 5. The SAN may not be fully 
connected in order to zone off certain connections that 
should not be able to connect. For example, certain 
hosts might be restricted to storing and retrieving data 
on particular storage devices. A system administrator 
may only allow NT hosts to store data on a particular 
device so that a mainframe does not have the ability to 
corrupt or destroy the NT data. 
[0044] The process continues with a CAM initializing 
the TCW and SCP for the SAN (step 802). The TCW is 
a time window and requires a time value, usually on the 
order from seconds to minutes. The SCP contains all 
sets of paths'chosen from the SDT. These paths reflect 
known host-to-storage, host-to-host, and storage-to- 
storage associations that are established by the topol- 
ogy mapping. Again, it should be noted that a secondary 
CAM maintains a replica of the data structures and val- 
ues that are stored in the primary CAM. 
• [0045] The CAM then receives a new error (step 803) 
and processes the error using the RDA (step 804). A 
determination is then made as to whetherthe RDA proc- 
ess is being terminated (step 805), and if not, the proc- 
ess then loops back to step 803 to receive and process 
more errors. If so, then the process of initializing for SAN 
end-to-end fault isolation is complete. 
[0046] Referring now to Figure 8B, a process depicts 
the processing of a new error, such as step 804 in Fig- 
ure 8A, in more detail in accordance with a preferred 
embodiment. The process begins by receiving a new er- 
ror (step 810), and the SDT is updated to indicate the 
component reporting the error, the time the error oc- 
curred, and the severity (high, medium, low) of the error 
(step 811). A determination is made as to whether the 
error is a high severity error (step 812). If so, then this 
error is immediately reported as a fault that requires 
maintenance (step 813). The SPDITis then interrogated 
to determine if the reported error is associated with a 
specific part that should be replaced (step 814). if not, 
then the processing of the high severity error is com- 
plete. If so, then the failing component is called out to 
be replaced (step 815), and the processing of the high 
severity error is complete. 

[0047] If the error is not a high severity error, then a 
determination is made as to whether the error is a me- 
dium or low severity error (step 816). If so, then the low/ 
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medium severity error is processed (step 817), and the 
error processing is complete. 

[0048] if the error is neither a high severity error or a 
low/medium severity error, then the error severity is de- 

5 termined to be faulty and the error ignored (step 818). 
[0049] Referring now to Figure 8C, a process depicts 
the processing of a new low/medium severity error, such 
as step 817 in Figure 8B, in accordance with a preferred 
embodiment of the present invention, in more detail. The 

io SCP is used to determine the paths that can be affected 
by the reported error. Each of the SDT cells for the ele- 
ments in these paths, including the element reporting 
the new error, are interrogated in turn for previous oc- 
currences of errors (step 820), and it is determined if the 

is occurrence of a previous error is spatially related to the 
current error (step 821). The interrogation then uses the 
TCW in order to determine if the occurrence of a previ- 
ous error is related to the current error in time as well 
as space (step 822). If the previous errors are temporally 

20 and spatially related, then the errors are stored into the 
SCP (step 823). After the interrogation is finished, the 
SCP contains the mapping of all errors on the appropri- 
ate paths in the SDT that occur within the time con- 
straint. 

25 [0050] The manner in which the data structure for the 
SCP is organized and used may vary depending upon 
system implementation. For example, the elements 
from the SDT may be copied into the SCP, and as errors 
are determined not to be related in space or in time, the 

30 elements may be deleted from the SCP. 

[0051] The algorithm preferably makes an error cor- 
relation/severity assessment in order to isolate the lo- 
cation of the failing component. Referring now to Figure 
8D, a flowchart depicts several possible cases for failing 

35 components associated with low/medium severity er- 
rors in accordance with a preferred embodiment of the 
present invention. 

[0052] The process begins with a determination of 
whether all errors emanate from the current element 

40 which generated the newly received error (step 830). If 
so, then a determination is made as to whether two or 
more errors are in the SCP (step 831). If not, then the 
processing of the current error is complete. If so, then 
the current element is indicated to require maintenance 

45 (step 832). The SPDIT is then interrogated to determine 
if the reported error is associated with a specific part that 
should be replaced (step 833). If so, then thef ailing com- 
ponent is called out to be replaced (step 834), and the 
newly received, low/medium severity error has been 

so processed. 

[0053] If all errors do not emanate from the current 
element, then a determination is made as to whether all 
(two or more) errors are contained in a single path (step 
835). In this case, any element in the path may be the 

55 root cause of the reported errors, and device hardware 
related errors take precedence over link related or time- 
out related errors. A determination is made as to wheth- 
erthe errors contain a device hardware error (step 836). 
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If a device hardware error is found, in a manner similar 
to steps 832-834 described above, the associated ele- 
ment is indicated to require maintenance, the SPDIT is 
then interrogated to determine if the reported error is 
associated with a specific part that should be replaced, 
and If so, the failing component is called out to be re- 
placed. 

[0054] If the errors on the single path do not contain 
a device hardware error, then only link or timeout errors 
are being reported. This situation can lead to degrada- 
tion in performance and eventual failure of the link. In 
this case, the algorithm looks for the element that is re- 
porting the error first (step 837), i.e., the first error takes 
precedence and the others are assumed to be related 
to the first occurrence. Once the element that is origi- 
nating the chain of errors is found, in a manner similar 
to steps 832-834 described above, the associated ele- 
ment is indicated to require maintenance, the SPDIT is 
then interrogated to determine if the reported error is 
associated with a specific part that should be replaced, 
and if so, the failing component is called out to be re- 
placed. 

[0055] If two or more errors are not contained in a sin- 
gle path, then two or more errors are occurring on mul- 
tiple paths. A determination is made as to whether there 
are any common elements on the paths of the multiple 
errors (step 838), and if so, then this case requires iso- 
lating the common element(s) on these paths (step 839) 
and performing an error correlation/severity assess- 
ment. 

[0056] The common elements can either be SAN end- 
point elements and/or SAN fabric elements. A determi- 
nation is made as to whether a SAN endpoint or fabric 
element is the only common element (step 840). If so, 
then in a manner similar to steps 832-834 described 
above, this common element is indicated as failing and 
maintenance is required. The SPDIT is then interrogat- 
ed to determine if the reported error is associated with 
a specific part that should be replaced. If so, the failing 
component is called out to be replaced. 
[0057] Otherwise, if a SAN endpoint or fabric element 
is not the only common element, then both a SAN end- 
point and a SAN fabric element are common. This situ- 
ation is now equivalent to the result from the determina- 
tion in step 835, and the process branches to step 836 
for further processing. 

[0058] If there are two or more errors that are not con- 
tained in a single path and there are no common ele- 
ments on the paths of these errors, then each of the mul- 
tiple errors are run through the real-time diagnostic al- 
gorithm (RDA) separately (step 841). This rare but pos- 
sible scenario may occur when more than one error has 
been received within the TCW and the errors originate 
from separately failing components. At this point, the er- 
ror process may branch back to step 804 to process 
each error as if each error were a newly received error. 
[0059] A SAN Diagnostic Table is created using the 
SAN topology, native Fibre Channel services, and ven- 
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dor specific information. The present invention prefera- 
bly supports both FC native in -band and host/device 
specific out-of-band status/error data collection for SAN 
problem determination. A real-time diagnostic algorithm 

5 may preferably then traverse the SAN Diagnostic Table 
to isolate a failing SAN component. The methodology is 
advantageous because it may be implemented on host 
operating environments such that special access to 
management terminals or device diagnostics are not, 

10 according to the preferred embodiment, required to iso- 
late failing components. In addition, the methodology is 
platform-independent, and it supports both FC native in- 
band and host/device-specific out-of-band status/error 
data collection for SAN problem determination. 

is [0060] It is important to note that while the present in- 
vention has been described in the context of a fully func- 
tioning data processing system, those of ordinary skill 
in the art will appreciate that the processes of the 
present invention are capable of being distributed in the 

20 form of a computer readable medium of instructions and 
a variety of forms and that the present invention applies 
equally regardless of the particular type of signal bear- 
ing media actually used to carry out the distribution. Ex- 
amples of computer readable media include recordable- 

25 type media such a floppy disc, a hard disk drive, a RAM, 
and CD-ROMs and transmission-type media such as 
digital and analog communications links. 



1 . A method for processing errors within a storage ar- 
ea network (SAN) (300), the method comprising the 
computer-implemented steps of: 

35 

generating a SAN topology map; 

generating a SAN problem determination infor- 
mation table (SPDIT) (400) ; and 

40 

generating a SAN diagnostic table (SDT) using 
the SAN topology map and the SPDIT. 

2. The method of claim 1 wherein the SAN topology 
m ap comprises a table in which each row of the SAN 
topology table is uniquely mapped to a communica- 
tion architecture element (CAE) (302, 304, 306, 
311, 312, 313, 321, 322, 323, 324, 325) and each 
column of the SAN topology table is uniquely 

so mapped to a CAE, wherein a CAE is a network-con- 
nected device that has successfully registered with 
a communications architecture manager (CAM) 
(302, 304, 308) via a network service protocol, 
wherein a CAM contains problem determination 

55 (PD) functionality for the SAN and maintains a SAN 
PD information table (SPDIT), and wherein the 
communication architecture (CA) comprises all net- 
work-connected elements capable of communicat- 
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ing information stored in the SPDIT. 

3. The method of claim 2 wherein the SPDIT compris- 
es at least one data record associated with each 
product or element on the CA. 

4. The method of claim 3 wherein at least one data 
record associated with each product or element on 
the CA further comprises one or more data items 
selected from the group consisting of: product ven- 
dor information (401 ); product identifier information 

(402) ; information concerning a type of communi- 
cation link supported by the product or element 

(403) ; and/or information concerning a type of error 
information to be reported by the product or element 

(404) . 

5. The method of claim 4 wherein the type of error in- 
formation indicates whether the product or element 
supports Extended Link Services (ELS) Registered 
Link Incident Record (RLIR). 

6. The method of claim 4 or 5 wherein the SDT stores 
information from the SAN topology map and errors 
received by the CAM from CAEs. 

7. The method of any of claims 2 to 6, wherein the CA 
is managed by the CAM, the method further com- 
prising: 

receiving an error message at the communica- 
tion architecture manager (CAM); and 

processing the error message using a real-time 
diagnostic algorithm (RDA). 

8. The method of any of claims 2 to 7 wherein a net- 
work supporting the CA comprises in-band Fibre 
Channel communication links and out-of-band 
communication links. 

9. The method of any of claims 2 to 8 wherein the SAN 
comprises: 

a plurality of storage devices connected to the 
network; and 

a plurality of host computers connected to the 
network, wherein at least one of the plurality of 
host computers comprises a CAM. 

10. The method of claim 7 further comprising: 

analyzing the received error message using a 
temporal correlation window (TCW) value to 
temporally constrain fault isolation determina- 
tion while searching for tempo rally-related error 
messages previously received by the CAM and 
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stored within the SDT; and 

analyzing the received error message using a 
spatial correlation path data structure (SCP) to 
5 spatially constrain fault isolation determination 

while searching for spatially-related error mes- 
sages previously received by the CAM and 
stored within the SDT. 

10 11. The method of claim 1 0 further comprising: 

analyzing the received error message using 
error severity weightings according to a type of error 
indicated by the received error message. 

15 12. A data processing system for communicating error 
information in a storage area network (SAN) (300), 
the data processing system comprising: 

a network comprising in-band Fibre Channel 
20 communication links and out-of-band commu- 

nication links, wherein the network supports a 
communications architecture (CA); 

a plurality of storage devices connected to the 
25 network; 

a plurality of host computers (302, 304, 306) 
connected to the network, wherein at least one 
of the plurality of host computers comprises a 

30 communications architecture manager (CAM) 

(302, 304) containing problem determination 
(PD) functionality, wherein a CAM maintains a 
SAN PD information table (SPDIT) (400), and 
wherein the CA comprises all network-connect- 

35 ed elements capable of communicating infor- 

mation stored in the SPDIT 

13. The data processing system of claim 12 further 
comprising: 

40 a plurality of CAMS, wherein the CA compris- 

es a primary CAM (304) and one or more secondary 
CAMS (308), wherein a secondary CAM operates 
redundantly for a primary CAM. 

14. The data processing system of claim 12 or 13 
wherein the CA further comprises or more CA ele- 
ments (CAEs) and one or more CA non-participants 
(CANs) (325), wherein a CAE is a network-connect- 
ed device that has successfully registered with a 

50 CAM via a network service protocol, and wherein a 
CAN is a network-connected device that has not 
registered with a CAM yet known to be present via 
a SAN topology discovery process. 

55 15. The data processing system of claim 12, 13 or 14 
wherein the wherein the in-band Fibre Channel 
communication links and the out-of-band commu- 
nication links are provided by a single, physical 
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communication link. 

16. A data processing system for processing errors 
within a storage area network (SAN), the data 
processing system comprising: 5 

first generating means for generating a SAN to- 
pology map; 

second generating means for generating a 10 
SAN problem determination information table 
(SPDIT); and 

third generating means for generating a SAN 
diagnostic table (SDT) using the SAN topology 15 
map and the SPDIT. 

17. A computer program comprising program code for 
controlling a data processing system to perform op- 
erations implementing a method for processing er- 20 
rors within a storage area network (SAN) (300), the 
program code comprising: 

first instructions for generating a SAN topology 
map; 25 
second instructions for generating a SAN prob- 
lem determination information table (SPDIT) 
(406); and 

third instructions for generating a SAN diagnos- 
tic table (SDT) using the SAN topology map and 30 
the SPDIT. 
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